{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Based on the kernel [Unusual meaning map: Treating question pairs as image / surface](https://www.kaggle.com/puneetsl/unusual-meaning-map) by Puneeth Singh Ludu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "97a4020f-95de-c0de-9ce2-b8e2307b4906"
   },
   "source": [
    "Unusual meaning map: Treating question pairs as image / surface\n",
    "---------------------------------------------------------------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "97a5d3c2-fa3b-452d-d629-3116968b6c1c"
   },
   "source": [
    "Other people have already written really nice exploratory kernels which helped me to write the minimal code myself. \n",
    "\n",
    "In this kernel, I have tried to extract a different type of feature from which we can learn using any algorithm which can learn via image. The basic assumption behind this exercise is to capture non-sequential closeness between words.\n",
    "\n",
    "For example:\n",
    "A Question pair has pointing arrows from each of the words of one sentence to each of the words from another sentence\n",
    "![A Question pair has pointing arrows from each of the words of one sentence to each of the words from another sentence][1]\n",
    "\n",
    "  [1]: http://image.prntscr.com/image/97e92b0357a843078b61eef5ad8a183b.png\n",
    "\n",
    "To capture this we can create NxM matrix with Word2Vec distance between each word with other. and resize the matrix just like an image to a 10x10 matrix and use this as a feature to xgboost."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pygoose import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "import gensim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim import corpora, models, similarities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Automatically discover the paths to various data folders and compose the project structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "project = kg.Project.discover()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Identifier for storing these features on disk and referring to them later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_list_id = '3rdparty_image_similarity'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Original question sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "_cell_guid": "9a81d538-d1a1-358a-8c34-7670147aeaec"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version\n",
      "of pandas will change to not sort by default.\n",
      "\n",
      "To accept the future behavior, pass 'sort=False'.\n",
      "\n",
      "To retain the current behavior and silence the warning, pass 'sort=True'.\n",
      "\n",
      "  This is separate from the ipykernel package so we can avoid doing imports until\n"
     ]
    }
   ],
   "source": [
    "df = pd.concat([\n",
    "    pd.read_csv(project.data_dir + 'train.csv'),\n",
    "    pd.read_csv(project.data_dir + 'test.csv'),\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unique document corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences = kg.io.load(project.preprocessed_data_dir + 'unique_questions_tokenized.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "7147800f-b006-e754-afe6-057975bdffba"
   },
   "source": [
    "**Creating a simple Word2Vec model from the question pair, we can use a pre-trained model instead to get better results**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "_cell_guid": "a80b9de3-48df-b6b4-2701-554f53d31d84"
   },
   "outputs": [],
   "source": [
    "model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "fe010b8f-1fea-d87a-5733-8cfecc960157"
   },
   "source": [
    "**A very simple term frequency and document frequency extractor** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "_cell_guid": "c5784c45-78b4-f124-5452-405777f0500e"
   },
   "outputs": [],
   "source": [
    "tf = dict()\n",
    "docf = dict()\n",
    "total_docs = 0\n",
    "\n",
    "for sentence in sentences:\n",
    "    total_docs += 1\n",
    "    uniq_toks = set(sentence)\n",
    "    \n",
    "    for i in sentence:\n",
    "        if i not in tf:\n",
    "            tf[i] = 1\n",
    "        else:\n",
    "            tf[i] += 1\n",
    "            \n",
    "    for i in uniq_toks:\n",
    "        if i not in docf:\n",
    "            docf[i] = 1\n",
    "        else:\n",
    "            docf[i] += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "a7e3ce1a-4cfa-ea08-8703-0659ac38b4a3"
   },
   "source": [
    "Mimic the IDF function but penalize the words which have fairly high score otherwise, and give a strong boost to the words which appear sporadically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "_cell_guid": "831e8912-bfed-8bc0-478d-714286d8b618"
   },
   "outputs": [],
   "source": [
    "def idf(word):\n",
    "    return 1 - math.sqrt(docf[word] / total_docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "_cell_guid": "146df649-d4c4-434b-78f7-32c1c9299b37"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9885943458314548\n"
     ]
    }
   ],
   "source": [
    "print(idf(\"kenya\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "e8693960-2062-3042-6cb4-f1affe4e4986"
   },
   "source": [
    "A simple cleaning module for feature extraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "_cell_guid": "01a2e717-4c55-0d11-b1a0-68c62a934f69"
   },
   "outputs": [],
   "source": [
    "def basic_cleaning(string):\n",
    "    string = str(string)\n",
    "    string = string.lower()\n",
    "    string = re.sub('[0-9\\(\\)\\!\\^\\%\\$\\'\\\"\\.;,-\\?\\{\\}\\[\\]\\\\/]', ' ', string)\n",
    "    string = ' '.join([i for i in string.split() if i not in [\"a\", \"and\", \"of\", \"the\", \"to\", \"on\", \"in\", \"at\", \"is\"]])\n",
    "    string = re.sub(' +', ' ', string)\n",
    "    return string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "_cell_guid": "c19639c7-73ce-115d-ece0-afad6f390c21"
   },
   "outputs": [],
   "source": [
    "def w2v_sim(w1, w2):\n",
    "    try:\n",
    "        return model.similarity(w1, w2) * idf(w1) * idf(w2)\n",
    "    except Exception:\n",
    "        return 0.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def img_feature(row):\n",
    "    s1 = row['question1']\n",
    "    s2 = row['question2']\n",
    "    t1 = list((basic_cleaning(s1)).split())\n",
    "    t2 = list((basic_cleaning(s2)).split())\n",
    "    Z = [[w2v_sim(x, y) for x in t1] for y in t2] \n",
    "    a = np.array(Z, order='C')\n",
    "    return [np.resize(a,(10,10)).flatten()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "6ea1b1e7-1d71-abeb-b6c1-eb27649c23cb"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).\n",
      "  This is separate from the ipykernel package so we can avoid doing imports until\n",
      "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
      "  if np.issubdtype(vec.dtype, np.int):\n"
     ]
    }
   ],
   "source": [
    "s = df\n",
    "img = s.apply(img_feature, axis=1, raw=True)\n",
    "pix_col = [[] for y in range(100)] \n",
    "\n",
    "for k in img.iteritems():\n",
    "    for f in range(len(list(k[1][0]))):\n",
    "        pix_col[f].append(k[1][0][f])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "3f70ee65-961b-e504-6319-1dc2f1fbdb9b"
   },
   "source": [
    "**Extracting Features**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "_cell_guid": "734650f7-3d1b-3368-e7d0-7177e77f0927"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df_X = pd.DataFrame()\n",
    "for g in range(len(pix_col)):\n",
    "    df_X[f'img{g:03d}'] = pix_col[g]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train = np.sum(df_X[:404290].values, axis=1).reshape(-1, 1)\n",
    "X_test = np.sum(df_X[404290:].values, axis=1).reshape(-1, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X_train: (404290, 1)\n",
      "X_test:  (2345796, 1)\n"
     ]
    }
   ],
   "source": [
    "print('X_train:', X_train.shape)\n",
    "print('X_test: ', X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "feature_names = [\n",
    "    'image_similarity'\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "project.save_features(X_train, X_test, feature_names, feature_list_id)"
   ]
  }
 ],
 "metadata": {
  "_change_revision": 1,
  "_is_fork": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
